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METHOD OF REFLECTING TIME/LANGUAGE DISTORTION IN 
OBJECTIVE SPEECH QUALITY ASSESSMENT 

J 

Field of the Invention 

The present invention relates generally to communications systems and, in 
particular, to speech quality assessment. 

Background of the Related Art 

Performance of a wireless communication system can be measured, 
among other things, in terms of speech quality. In the current art, there are two 
techniques of speech quality assessment. The first technique is a subjective technique 
(hereinafter referred to as "subjective speech quality assessment"). In subjective speech 
quality assessment, human listeners are typically used to rate the speech quality of 
processed speech, wherein processed speech is a transmitted speech signal which has 
been processed at the receiver. This technique is subjective because it is based on the 
perception of the individual human, and human assessment of speech quality 'by native 
listeners, i.e., people that speak the language of the speech material being presented or 
listened, typically takes into account language effects. Studies have shown that a 
listener's knowledge of language affects the scores in subjective listening tests. Scores 
given by native listeners were lower in subjective listening tests compared to scores given 
by non-native listeners when language information in speech is defect, i.e., mute. In a 
normal telephone conversation, the listener is often a native listener. Thus, it is 
preferable to use native listeners for subjective speech quality assessment in order to 
emulate typical conditions. Subjective speech quality assessment techniques provide a 
good assessment of speech quality but can be expensive and time consuming. 

The second technique is an objective technique (hereinafter referred to as 
"objective speech quality assessment"). Objective speech quality assessment is not based 
on the perception of the individual human. Some objective speech quality assessment 
techniques are based on known source speech or reconstructed source speech estimated 
from processed speech. Other objective speech quality assessment techniques are not 
based on known source speech but on processed speech only. These latter techniques are 
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referred to herein as "single-ended objective speech quality assessment techniques" and 
are often used when known source speech or reconstructed source speech are unavailable. 

Current single-ended objective speech quality assessment techniques, 
however, do not provide as good an assessment of speech quality compared to subjective 
speech quality assessment techniques. One reason why current single-ended objective 
speech quality assessment techniques are not as good as subjective speech quality 
assessment techniques is because the former techniques do not account for language 
effects. Current single-ended objective speech quality assessment techniques have been 
unable to account for language effects in its speech assessment. 

Accordingly, there exists a need for a single-ended objective speech 
quality assessment technique which accounts for language effects in assessing speech 
quality. 

Summary of the Invention 
15 The present invention is an objective speech quality assessment technique 

that reflects the impact of distortions which can dominate overall speech quality 
assessment by modeling the impact of such distortions on subjective speech quality 
assessment, thereby, accounting for language effects in objective speech quality 
assessment. In one embodiment, the objective speech quality assessment technique of the 
20 present invention comprises the steps of detecting distortions in an interval of speech 
activity using envelope information, and modifying an objective speech quality 
assessment value associated with the speech activity to reflect the impact of the 
distortions on subjective speech quality assessment. In one embodiment, the objective 
speech quality assessment technique also distinguish types of distortions, such as short 
25 bursts, abrupt stops and abrupt starts, and modifies the objective speech quality 

assessment values to reflect the different impacts of each type of distortion on subjective 
speech quality assessment. 
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Brief Description of the Drawings 

The features, aspects, and advantages of the present invention will become 
better understood with regard to the following description, appended claims, and 
accompanying drawings where: 
5 Fig. 1 depicts a flowchart illustrating an objective speech quality assessment 

technique accounting for language effects in accordance with one embodiment of the 
present invention; 

Fig. 2 depicts a flowchart illustrating a voice activity detector (VAD) which 
detects voice activity by examining envelope information associated with the speech 
10 signal in accordance with one embodiment of the present invention; 

Fig. 3 depicts an example VAD activity diagram illustrating intervals T and G of 
speech and non-speech activities, respectively; 

Fig. 4 depicts a flowchart illustrating an embodiment for determining whether 
speech activity is a short burst or impulsive noise and for modifying objective speech 
1 5 frame quality assessment v s (m) when a short burst or impulsive noise is determined; 

Fig. 5 depicts a flowchart illustrating an embodiment for determining whether 
speech activity has an abrupt stop or mute and for modifying objective speech frame 
quality assessment v s (m) when it is determined that such speech activity has an abrupt 
stop or mute; and 

20 Fig. 6 depicts a flowchart illustrating an embodiment for determining whether 

speech activity has an abrupt start and for modifying objective speech frame quality 
assessment v s (m) when it is determined that such speech activity has an abrupt start. 

Detailed Description 

25 The present invention is an objective speech quality assessment technique 

that reflects the impact of distortions which can dominate overall speech quality 
assessment by modeling the impact of such distortions on subjective speech quality 
assessment, thereby, accounting for language effects in objective speech quality 
assessment. 

30 Fig. 1 depicts a flowchart 100 illustrating an objective speech quality 

assessment technique accounting language effects in accordance with one embodiment of 
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the present invention. In step 102, speech signal s(n) is processed to determine objective 
speech frame quality assessment v s (m), i.e., objective quality of speech at frame m. In 
one embodiment, each frame m corresponds to a 64 ms interval. The manner of 
processing a speech signal s(n) to obtain objective speech frame quality assessment v s (m) 

5 (which do not account for language effects) is well-known in the art. One example of 
such processing is described in co-pending application serial number 10/186,862, entitled 
"Compensation Of Utterance-Dependent Articulation For Speech Quality Assessment", 
filed on July 01, 2002 by inventor Doh-Suk Kim, attached herein as Appendix A. 

In step 105, speech signal s(n) is analyzed for voice activity by, for 

10 example, a voice activity detector (V AD). VADs are well-known in the art. Fig. 2 
depicts a flowchart 200 illustrating a VAD which detects voice activity by examining 
envelope information associated with the speech signal in accordance with one 
embodiment of the present invention. In step 205, envelope signals yk(n) are summed up 
for all cochlear channels k to form summed envelope signal y(n) in accordance with 

15 equation (1): 

y(n) = ^y k (n) equation (1) 

k-\ 



where y k (n) = ^sj (n) + sl(n) , n represents a time index, N c b represents a total number of 

critical bands, Skfn) represents the output of speech signal s(n) through cochlear channel 
£, i.e., s k (n) = s{ri) * h k (n) , and s k (n) is the Hilbert transform of Sk(n). 

20 In step 210, a frame envelope e(l) is computed every 2 ms by multiplying 

summed envelope signal y(n) with a 4 ms Hamming window w(n) in accordance with 
equation (2): 



e(l) = log 



31 



equation (2) 



where y {l) {ri) is the 2 ms /-th frame signal of the summed envelope signal y(n). It should 

25 be understood that the durations of the frame envelope e(l) and Hamming window w(n) 
are merely illustrative and that other durations are possible. In step 215, a flooring 
operation is applied to frame envelope e(l) in accordance with equation (3). 
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e{l) if e{l) > 5 

equation (3) 

5 otherwise 



In step 220, time derivative Ae(/) of floored frame envelope e(l) is obtained in 
accordance with equation (4). 



3 



Ae(/) = equation (4) 

■2 



7=5-3 

5 where -3<y<3. 

In step 225, voice activity detection is performed in accordance with 

equation (5). 

fl ife(l)>5 
vad (/) = < equation (5) 

[0 otherwise 

In step 230, the result of equation (5), i.e., vad(l), can then be refined based on the 
10 duration of l's and O's in the output. For example, if the duration of O's in vad(l) is 

shorter than 8 ms, then vad(l) shall be changed to 1 's for that duration. Similarly, if the 
duration of l's in vad(l) is shorter than 8 ms, the vadfl) shall be changed to O's for that 
duration. Fig. 3 depicts an example VAD activity diagram 30 illustrating intervals T and 
G of speech and non-speech activities, respectively. It should be understood that speech 
15 activities associated with intervals T may include, for example, actual speech, data or 
noise. 

Returning to flowchart 100 of Fig. 1, upon analyzing speech signal s(n) for 
speech activity, interval T is examined to determine whether the associated speech 
activity corresponds to a short burst or impulsive noise in step 1 10. If the speech activity 
20 in interval T is determined to be a short burst or impulsive noise, then objective speech 
frame quality assessment v s (m) is modified in step 1 15 to obtain a modified objective 
speech frame quality assessment V${m) . The modified objective speech frame quality 

assessment Vp(m) accounts for the effects of short burst or impulsive noise by modeling 

or simulating the impact of short bursts or impulsive noise on subjective speech quality 
25 assessment. 
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From step 1 15 of if in step 1 10 the speech activity in interval T is not 
determined to be a short burst or impulsive noise, then flowchart 100 proceeds to step 
120 where the speech activity in interval T is examined to determine whether it has an 
abrupt stop or mute. If the speech activity in interval T is determined to have an abrupt 
5 stop or mute, then objective speech frame quality assessment v s (m) is modified in step 
125 to obtain a modified objective speech frame quality assessment Vp(m) . The modified 

objective speech frame quality assessment V^m) accounts for the effects of the abrupt 

stop or mute by modeling or simulating the impact of an abrupt stop or mute and 
subsequent release on subjective speech quality assessment. 

10 v From step 125 or if in step 120 the speech activity in interval T is not 

determined to have an abrupt stop or mute, then flowchart 100 proceeds to step 130 
where the speech activity in interval T is examined to determine whether it has an abrupt 
start. If the speech activity in interval T is determined to have an abrupt start, then 
objective speech frame quality assessment v s (m) is modified in step 135 to obtain a >' 

15 modified objective speech frame quality assessment V^rn) . The objective speech frame 

quality assessment v s (m) accounts for the effects of the abrupt start by modeling or 
simulating the impact of an abrupt start on subjective speech quality assessment. From 
step 135 or if in step 130 the speech activity in interval T is not determined to have an 
abrupt start, then flowchart 100 proceeds to step 145 where the results of modifications to 

20 objective speech frame quality assessment v s (m), if any, are integrated into the original 
objective speech frame quality assessment v s (m) of step 102. 

Techniques for determining whether speech activity is a short burst (or 
impulsive noise) or has an abrupt stop (or mute) or an abrupt start, i.e., steps 1 10, 120 and 
130, along with techniques for modifying objective speech frame quality assessment 

25 v s (m), i.e., steps 115, 125 and 135, in accordance with one embodiment of the invention 
will now be described. Fig. 4 depicts a flowchart 400 illustrating an embodiment for 
determining whether speech activity is a short burst or impulsive noise and for modifying 
objective speech frame quality assessment v s (m) when a short burst or impulsive noise is 
determined. In step 405, an impulsive noise frame // is determined by finding a frame / in 

30 interval T, where frame envelope e(l) is maximum in accordance, for example, with 
equation (6): 
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// = arg max e(l) equation (6) 

where m and d t represents frames / at the beginning and end of interval T„ respectively. 
In step 410, frame envelope e(li) is compared to a listener threshold value indicating 
whether a human listener can consider the corresponding frame // as annoying short burst. 
5 In one embodiment, the listener threshold value is 8 — that is, in step 410, e(li) is checked 
to determine whether it is greater than 8. If frame envelope e(li) is not greater than the 
listener threshold value, then in step 415 the speech activity is determined not to be a 
short burst or impulsive noise. 

If frame envelope e(l/) is greater than the listener threshold value, then in 

10 step 420 the duration of interval T,- is checked to determine whether it satisfies both a 
short burst threshold value and a perception threshold value. That is, interval T, is being 
checked to determine whether interval T,- is not too short to be perceived by a human 
listener and not too long to be categorized as a short burst, In one embodiment, if the 
duration of interval T, is greater than or equal to 28 ms and less than or equal to 60 ms, 

1 5 i.e., 28<T t <60, then both of the threshold values of step 420 are satisfied. Otherwise the 
threshold values of step 420 are not satisfied. If the threshold values of step 420 are not 
satisfied, then in step 425 the speech activity is determined not to be a short burst or 
impulsive noise. 

If the threshold values of step 420 are satisfied, then in step 430 a 
20 maximum delta frame envelope Ae(l) is determined from the frame envelopes e(l) in the 
one or more frames prior to the beginning of interval T, through the first one or more 
frames of interval T, and subsequently compared to an abrupt change threshold value, 
such as 0.25. The abrupt change threshold value representing a criteria for identifying an 
abrupt change in the frame envelope. In one embodiment, a maximum delta frame 
25 envelope Ae(l) is determined from frame envelope e(u r \), i.e., frame envelope 

immediately preceding interval T„ through the frame envelope e(ui+5), i.e., fifth frame 
envelope in interval T/, and compared to a threshold value of 0.25 — that is, in step 430, it 
is checked to determine whether equation (7) is satisfied: 

max Ae(l) > 0.25 equation (7) 

w ( . -!</<«, +5 
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If the maximum delta frame envelope ke(l) does not exceed the threshold value, then in 
step 435 the speech activity is determined not to be a short burst or impulsive noise. 

If the maximum delta frame envelope Ae(l) does exceed the threshold 
value, then in step 440 it is determined whether frame rrij would be sufficiently annoying 
5 to a human listener, where m/ corresponds to the frame m which is impacted most by 
impulsive noise frame //. In one embodiment, step 440 is achieved by determining 
whether a ratio of objective speech frame quality assessment v s (rni) to modulation noise 
reference unit v q (m/) exceeds a noise threshold value. Step 440 may be expressed, for 
example, using a noise threshold value of 1.1 and equation (8): 

10 X&X<U equation (8) 

wherein if equation (8) is satisfied, it would be determined that frame m/has sufficient 
annoyance to a human listener. If it is determined that objective speech frame quality 

> 

assessment v s (mi) would be sufficiently annoying to a human listener, then in step 445 the 
speech activity is determined not to be a short burst or impulsive noise. 
15 If it is determined that objective speech frame quality assessment v s (m/) 

would not be sufficiently annoying to a human listener, then in step 450 conditions 
related to the durations of intervals G,-i,/, G,-,,+i> TVi and/or T;+i satisfying certain 
minimum or maximum duration threshold values are checked to verify that it belongs to 
human speech. In one embodiment, the conditions of step 450 are expressed as equations 
20 (9) and (10). 

Gm,,-< 180 ms and G/, /+ i > 40 ms and T M > 50 ms equation (9) 
G f -i,,*> 40 ms and G^h-i < 100 ms and Ti+i > 60 ms equation (10) 
If any of these equations or conditions are satisfied, then in step 455 the speech activity is 
determined not to be a short burst or impulsive noise. Rather the speech activity is 
25 determined to be natural speech. It should be understood that the minimum and 

maximum duration threshold values used in equations (9) and (10) are merely illustrative 
' and may be different. 

If none of the conditions in step 450 are satisfied, then in step 460 
objective speech frame quality assessment v s (m) is modified in accordance with equation 
30 11: 
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r w , M 1ft1 equation (11) 

1 + exp [-8.2(m -m^l e\l } ) - 1 OJ 

Fig. 5 depicts a flowchart 500 illustrating an embodiment for determining 
whether speech activity has an abrupt stop or mute and for modifying objective speech 
frame quality assessment v s (m) when it is determined that such speech activity has an 
5 abrupt stop or mute. In step 505, abrupt stop frame Im is determined. The abrupt stop 
frame l M is determined by first finding negative peaks of delta frame envelope Ae(l) in the 
speech activity using all frames / in interval TV Delta frame envelope Ae(l) has a 
negative peak at / if Ae(l) < Ae(l+j) for 3 <j < 3. Upon finding the negative peaks, abrupt 
stop frame Im is determined as the minimum of the negative peaks of delta frame 

10 envelopes Ae(l). In step 510, delta frame envelope Ae(lu) is checked to determined 
whether an abrupt stop threshold value is satisfied. The abrupt stop threshold 
representing a criteria for determining whether there was sufficient negative change in 
frame envelope from one frame / to another frame /+1 to be considered an abrupt stop. In 
one embodiment, the abrupt stop threshold value is -0.56 and step 510 may be expressed 

15 as equation (12): 

Ae(l M )<-0.56 equation (12) 

If delta frame envelope Ae flM) does not satisfy the abrupt stop threshold value, then in 
step 515 the speech activity is determined not to have an abrupt stop or mute. 

If delta frame envelope Ae(ly) does satisfy the abrupt stop threshold value, 

20 then in step 520 interval T, is checked to determine if the speech activity is of sufficient 
duration, e.g., longer than a short burst. In one embodiment, the duration of interval T/ is 
checked to see if it exceeds the duration threshold value, e.g., 60 ms. That is, if T/ < 60 
ms, then the speech activity associated with interval T, is not of sufficient duration. If the 
speech activity is considered not of sufficient duration, then in step 525 the speech 

25 activity is determined not to have an abrupt stop or mute. 

If the speech activity is considered of sufficient duration, then in step 530 
a maximum frame envelope e(l) is determined for one or more frames prior to frame Im 
through frame Im or beyond and subsequently compared against a stop-energy threshold 
value. The stop-energy threshold value representing a criteria for determining whether a 

30 frame envelope has sufficient energy prior to muting. In one embodiment, maximum 
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frame envelope e(l) is determined for frames /m-7 through l M and compared to a stop- 
energy threshold value of 9.5, i.e., max e(l) > 9.5 . If the maximum frame envelope e(l) 



l _7</</ 



does not satisfy the stop-energy threshold value, then in step 535 the speech activity is 
determined not to have an abrupt stop or mute. 
5 If the maximum frame envelope e(l) does satisfy the stop-energy threshold 

value, then objective speech frame quality assessment v s (m) is modified in accordance 
with equation 13 for several frames m, such as jtim, ...,rriM+6' 



mm) = \Ae(l M )\ 



-6 



equation (13) 



1 + exp [-2(m - m M - 3] 

where mu corresponds to the frame m which is impacted most by abrupt stop frame Im- 
10 Fig. 6 depicts a flowchart 600 illustrating an embodiment for determining 

whether speech activity has an abrupt start and for modifying objective speech frame 
quality assessment v s (m) when it is determined that such speech activity has an abrupt 
start. In step 605, abrupt start frame Is is determined. The abrupt start frame 1$ is 
determined by first finding positive peaks of delta frame envelope Ae(l) in the speech 
15 activity using all frames / in interval T,. Delta frame envelope Ae(l) has a positive peak at 
/ if Ae(l) > Ae(l+j) for 3 <j < 3. Upon finding the positive peaks, abrupt start frame l s is 
determined as the maximum of the positive peaks of delta frame envelopes Ae(I). In step 
610, delta frame envelope Ae(ls) is checked to determined whether an abrupt start 
threshold value is satisfied. The abrupt start threshold representing a criteria for 
20 determining whether there was sufficient positive change in frame envelope from one 
frame / to another frame /+1 to be considered an abrupt start. In one embodiment, the 
abrupt stop threshold value is 0.9 and step 610 may be expressed as equation (14): 

Ae(/ 5 )>0.9 equation (14) 

If delta frame envelope Ae(ls) does not satisfy the abrupt start threshold value, then in 
25 step 615 the speech activity is determined not to have an abrupt start. 

If delta frame envelope Ae(l s ) does satisfy the abrupt start threshold value, 
then in step 620 interval T, is checked to determined if the speech activity is of sufficient 
duration, e.g., longer than a short burst. In one embodiment, the duration of interval T, is 
checked to see if it exceeds the short burst threshold value, e.g., 60 ms. That is, if T/ < 60 



10 



D.S. Kim 4 



ms, then the speech activity associated with interval T, is not of sufficient duration. If the 
speech activity is not of sufficient duration, then in step 625 the speech activity is . 
determined not to have an abrupt start. 

If the speech activity is of sufficient duration, then in step 630 a maximum 

i 

5 frame envelope e(l) is determined for frame Is or prior through one or more frames after 
frame Is and subsequently compared against a start-energy threshold value. The start- 
energy threshold value representing a criteria for determining whether a frame envelope 
has sufficient energy. In one embodiment, maximum frame envelope e(l) is determined 
for frames Is through Is +7 and compared to a start-energy threshold value of 12, i.e., 
10 max e(l) < 12 . If the maximum frame envelope e(l) does not satisfy the start-energy 

/ 5 </</ 5 +7 

threshold value, then in step 635 the speech activity is determined not to have an abrupt 
start. 

If the maximum frame envelope e(l) does satisfy the start-energy threshold 
value, then objective speech frame quality assessment v s (m) is modified in accordance 
15 with equation 16 for several frames m, such as thm, ...,mM+6: 

^ m) = l r n^ V>) WA 77T ml equation (16) 

1 + exp [ ^0A(m ~m s )l Ae(l s ) - 1 0 J 

where ms corresponds to the frame m which is impacted most by abrupt start frame 
It should be understood that the values used in equations (1 1), (13) and (16) were derived 
empirically. Other values are possible. Thus, the present invention should not be limited 
20 to those specific values. 

Note that upon determining modified objective speech frame quality 
assessment V^m) , the integration performed in step 145 may be achieved using equation 

(17): 

v s {m) = min(v SI (mlv sM (mlv sS (m)) equation (17) 

25 where v s j(m), v S M(m) and v St s(m) correspond to the modified objective speech frame 
quality assessment Vp(m) of equations 11, 13 and 16, respectively. 

Although the present invention has been described in considerable detail 
with reference to certain embodiments, other versions are possible. For example, the 
orders of the steps in the flowcharts may be re-arranged, or some steps (or criteria) may 
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be deleted from or added to the flowcharts. Therefore, the spirit and scope of the present 
invention should not be limited to the description of the embodiments contained herein. 
It should also be understood to those skilled in the art that the present invention may be 
implemented either as hardware or software incorporated into some type of processor. 
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Claims 
I claim: 

1. A method for objectively assessing speech quality comprising the steps of: 

detecting distortions in an interval of speech activity using envelope 
5 information; and 

modifying an objective speech quality assessment value associated with 
the speech activity to reflect the impact of the distortions on subjective speech 
quality assessment. 

10 2. The method of claim 1, wherein the step of modifying includes the step of 

determining the objective speech quality assessment values for the speech 
activity. 



3. The method of claim 1, wherein the distortions being detected are impulsive noise, 
1 5 abrupt stop or abrupt start. 



4. The method of claim 1, wherein the step of detecting includes the step of 
determining a distortion type. 



20 5. The method of claim 4, wherein the distortion type is determined to be impulsive 

noise if the envelope information indicates that the speech activity can be 
perceived by a human listener to be noise and if the interval is of a duration long 
enough to be perceived by a human listener but not too long for a short burst. 



25 6. The method of claim 4, wherein the distortion type is determined to be impulsive 

noise if the envelope information indicates that the speech activity can be 
perceived by a human listener to be noise, if a ratio of the objective speech quality 
assessment value to a modulation noise reference unit indicates a human listener 
would perceive annoying noise, and if the interval is of a duration long enough to 

30 be perceived by a human listener but not too long for a short burst. 
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The method of claim 4, wherein the objective quality assessment value associated 
with the speech activity is modified in accordance with the following equation to 
obtain a modified objective quality assessment value if the distortion type is 
impulsive noise: 

Km)= V - (m) 

1 + exp [-8.20* - rrij ) / e(/ 7 )~ 10] 

where v s (m) is the objective quality assessment value and V<£m) is the modified 
objective quality assessment value. 



The method of claim 4, wherein the distortion type is determined to be abrupt stop 
if the envelope information indicates that there was an sufficient negative change 
in frame energy from one frame to another to be considered an abrupt stop and if 
the interval is of a duration longer than a short burst. 



The method of claim 4, wherein the distortion type is determined to be abrupt stop 
if the envelope information indicates that a maximum frame envelope had 
sufficient energy prior to ending the interval, and if the interval is of a duration 
longer than a short burst. 



The method of claim 4, wherein the objective quality assessment value associated 
with the speech activity is modified in accordance with the following equation to 
obtain a modified objective quality assessment value if the distortion type is 
impulsive noise: 



^m) = \Ae(l M )\ 



-6 



1 + exp [~2(m - m M - 3] 

where v s (m) is the objective quality assessment value and V$(m) is the modified 
objective quality assessment value. 



The method of claim 4, wherein the distortion type is determined to be abrupt start 
if the envelope information indicates that there was an sufficient positive change 
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20 



in frame energy from one frame to another to be considered an abrupt start and if 
the interval is of a duration longer than a short burst. 



12. The method of claim 4, wherein the distortion type is determined to be abrupt stop 
if the envelope information indicates that a maximum frame envelope had 
sufficient energy towards a beginning of the interval, and if the interval is of a 
duration longer than a short burst. 



13. The method of claim 4, wherein the objective quality assessment value associated 
10 with the speech activity is modified in accordance with the following equation to 

obtain a modified objective quality assessment value if the distortion type is 
impulsive noise: 

mm)^ ■ v - (m) - 

5 1 + exp [-0A(m - m s ) / Ae(l s ) - 1 0] 

where v s (m) is the objective quality assessment value and V^jn) is the modified 
1 5 objective quality assessment value. 



14. The method of claim 1 comprising the additional step of: 

prior to the step of detecting, determining the interval of speech activity 
using the envelope information. 



15. An objective speech quality assessment system comprising: 

means for detecting distortions in an interval of speech activity using 
envelope information; and 

means for modifying an objective speech quality assessment value 
25 associated with the speech activity to reflect the impact of the distortions on 

subjective speech quality assessment. 



16. The objective speech quality assessment system of claim 15, wherein the means 
for modifying includes a means for determining the objective speech quality 
30 assessment values without accounting for distortions for the speech activity. 
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The objective speech quality assessment system of claim 15, wherein the 
distortions being detected are impulsive noise, abrupt stop or abrupt start. 

The objective speech quality assessment system of claim 15, wherein the means 
for detecting includes a means for determining a distortion type. 

The objective speech quality assessment system of claim 18, wherein the means 
for detecting includes a voice activity detector for detecting intervals of speech 
activity, wherein the means for determining a distortion type examines intervals of 
speech activities detected by the voice activity detector. 
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Abstract of the Disclosure 

Disclosed is an objective speech quality assessment technique that reflects 
the impact of distortions which can dominate overall speech quality assessment by 
modeling the impact of such distortions on subjective speech quality assessment, thereby, 
accounting for language effects in objective speech quality assessment. 
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COMPENSATION FOR UTTERANCE DEPENDENT ARTICULATION 

FOR SPEECH QUALITY ASSESSMENT 

Field of the Invention 

5 The present invention relates generally to communications systems and, in 

particular, to speech quality assessment. 

Background of the Related Art 

Performance of a wireless communication system can be measured, 
10 among other things, in terms of speech quality. In the current art, there are two 

* 

techniques of speech quality assessment. The first technique is a subjective technique 
(hereinafter referred to as "subjective speech quality assessment"). In subjective speech 
quality assessment, human listeners are used to rate the speech quality of processed 
speech, wherein processed speech is a transmitted speech signal which has been 

15 processed at the receiver. This technique is subjective because it is based on the 

perception of the individual human, and human assessment of speech quality typically 
takes into account phonetic contents, speaking styles or individual speaker differences. 
Subjective speech quality assessment can be expensive and time consuming. 

The second technique is an objective technique (hereinafter referred to as 

20 "objective speech quality assessment"). Objective speech quality assessment is not based 
on the perception of the individual human. Most objective speech quality assessment 
techniques are based on known source speech or reconstructed source speech estimated 
from processed speech. However, these objective techniques do not account for phonetic 
contents, speaking styles or individual speaker differences. 

25 Accordingly, there exists a need for assessing speech quality objectively 

which takes into account phonetic contents, speaking styles or individual speaker 
differences. 

Summary of the Invention 
30 The present invention is a method for objective speech quality assessment 

that accounts for phonetic contents, speaking styles or individual speaker differences by 
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distorting speech signals under speech quality assessment. By using a distorted version 

* » 

of a speech signal, it is possible to compensate for different phonetic contents, different 
individual speakers and different speaking styles when assessing speech quality. The 
amount of degradation in the objective speech quality assessment by distorting the 

5 speech signal is maintained similarly for different speech signals, especially when the 
amount of distortion of the distorted version of speech signal is severe. Objective speech 
quality assessment for the distorted speech signal and the original undistorted speech 
signal are compared to obtain a speech quality assessment compensated for utterance 
dependent articulation. In one embodiment, the comparison corresponds to a difference 

10 between the objective speech quality assessments for the distorted and undistorted speech 
signals. 

Brief Description of the Drawings 

The features, aspects, and advantages of the present invention will become 
15 better understood with regard to the following description, appended claims, and 
accompanying drawings where: 

Fig. 1 depicts an objective speech quality assessment arrangement which 
compensates for utterance dependent articulation in accordance with the present 
invention; 

20 Fig. 2 depicts an embodiment of an objective speech quality assessment module 

employing an auditory-articulatory analysis module in accordance with the present 
invention.; 

Fig. 3 depicts a flowchart for processing, in an articulatory analysis module, the 
plurality of envelopes aj(t) in accordance with one embodiment of the invention; and 
25 Fig. 4 depicts an example illustrating a modulation spectrum Aj(m,f) in terms of 

power versus frequency. 

Detailed Description 

The present invention is a method for objective speech quality assessment 
30 that accounts for phonetic contents, speaking styles or individual speaker differences by 
distorting processed speech. Objective speech quality assessment tend to yield different 
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values for different speech signals which have same subjective speech quality scores. The 
reason these values differ is because of different distributions of spectral contents in the 
modulation spectral domain. By using a distorted version of a processed speech signal, it 
is possible to compensate for different phonetic contents, different individual speakers 
and different speaking styles. The amount of degradation in the objective speech quality 
assessment by distorting the speech signal is maintained similarly for different speech 
signals, especially when the distortion is severe. Objective speech quality assessment for 
the distorted speech signal and the original undistorted speech signal are compared to 
obtain a speech quality assessment compensated for utterance dependent articulation. 

Fig. 1 depicts an objective speech quality assessment arrangement 10 
which compensates for utterance dependent articulation in accordance with the present 
invention. Objective speech quality assessment arrangement 10 comprises a plurality of 
objective speech quality assessment modules 12, 14, a distortion module 16 and a 
compensation utterance-specific bias module 18. Speech signal s(t) is provided as inputs 
to distortion module 16 and objective speech quality assessment module 12. In distortion 
module 16, speech signal s(t) is distorted to produce a modulated noise reference unit 
(MNRU) speech signal s'(t). In other words, distortion module 16 produces a noisy 
version of input signal s(t). MNRU speech signal s'(t) is then provided as input to 
objective speech quality assessment module 14. 

In objective speech quality assessment modules 12, 14, speech signal s(t) 
and MNRU speech signal s'(t) are processed to obtain objective speech quality 
assessments SQ(s(t) and SQ(s'(t)). Objective speech quality assessment modules 12, 14 
are essentially identical in terms of the type of processing performed to any input speech 
signals. That is, if both objective speech quality assessment modules 12, 14 receive the 
same input speech signal, the output signals of both modules 12, 14 would be 
approximately identical. Note that, in other embodiments, objective speech quality 
assessment modules 12, 14 may process speech signals s(t) and s'(t) in a manner different 
from each other. Objective speech quality assessment modules are well-known in the art. 
An example of such a module will be described later herein. 

Objective speech quality assessments SQ(s(t) and SCXs^t)) are then 
compared to obtain speech quality assessment SQ comp ensated, which compensates for 
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utterance dependent articulation. In one embodiment, speech quality assessment 

SQcompensatcd is determined using the difference between objective speech quality 

assessments SQ(s(t) and SQ(s'(t)). For example, SQcompensated is equal to SQ(s(t) minus 

SQ(s'(t)), or vice-versa. In another embodiment, speech quality assessment SQ CO mpensated 

5 is determined based on a ratio between objective speech quality assessments SQ(s(t) and 

SQ(s'(t)). For example, 

SO - SQ(s(t))+ti _ SQ(s'(t)^ 

VcompeniUed SQ(s'(t))+n Vcompens8,ed SQ(s(t))+^i 

where \i is a small constant value. 

As mentioned earlier, objective speech quality assessment modules 12, 14 

10 are well known in the art. Fig. 2 depicts an embodiment 20 of an objective speech 

quality assessment module 12, 14 employing an auditory-articulatory analysis module in 
accordance with the present invention. As shown in Fig. 2, objective quality assessment 
module 20 comprises of cochlear filterbank 22, envelope analysis module 24 and 
articulatory analysis module 26. In objective quality assessment module 20, speech 

1 5 signal s(t) is provided as input to cochlear filterbank 22. Cochlear filterbank 22 

♦ 

comprises a plurality of cochlear filters hi(t) for processing speech signal s(t) in 
accordance with a first stage of a peripheral auditory system, where i=l,2,...,N c represents 
a particular cochlear filter channel and N c denotes the total number of cochlear filter 
channels. Specifically, cochlear filterbank 22 filters speech signal s(t) to produce a 
20 plurality of critical band signals Si(t), wherein critical band signal si(t) is equal to 

S(t)*hi(t). 

The plurality of critical band signals S{(t) is provided as input to envelope 
analysis module 24. In envelope analysis module 24, the plurality of critical band signals 

Si(t) is processed to obtain a plurality of envelopes atft), wherein a { (O^sf (t)+s?(t) and 

25 Sj (t) is the Hilbert transform of s { (t) . 

The plurality of envelopes at(t) is then provided as input to articulatory 
analysis module 26. In articulatory analysis module 26, the plurality of envelopes ai(t) is 
processed to obtain a speech quality assessment for speech signal s(t). Specifically, 
articulatory analysis module 26 does a comparison of the power associated with signals 

30 generated from the human articulatory system (hereinafter referred to as "articulation 
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power P A (m,i)") with the power associated with signals not generated from the human 
articulatory system (hereinafter referred to as "non-articulation power P NA (m,i)")* Such 
comparison is then used to make a speech quality assessment. 

Fig. 3 depicts a flowchart 300 for processing, in articulatory analysis 
5 module 26, the plurality of envelopes ai(t) in accordance with one embodiment of the 
invention. In step 310, Fourier transform is performed on frame m of each of the 
plurality of envelopes ai(t) to produce modulation spectrums Ai(m,f), where f is 
frequency. 

Fig. 4 depicts an example 40 illustrating modulation spectrum Ai(m,f) in 
10 terms of power versus frequency. In example 40, articulation power PA(m,i) is the power 
associated with frequencies 2-12.5 Hz, and non-articulation power PnaOim) is the power 
associated with frequencies greater than 12.5 Hz. Power PN 0 (m,i) associated with 
frequencies less than 2 Hz is the DC-component of frame m of critical band signal a{(t). 
In this example, articulation power PA(m,i) is chosen as the power associated with 
15 frequencies 2-12.5 Hz based on the fact that the speed of human articulation is 2-12.5 
Hz, and the frequency ranges associated with articulation power P A (m,i) and non- 
articulation power PNA(m,i) (hereinafter referred to respectively as "articulation 
frequency range" and "non-articulation frequency range") are adjacent, non-overlapping 
frequency ranges. It should be understood that, for purposes of this application, the term 
20 "articulation power PA(m,i)" should not be limited to the frequency range of human 
articulation or the aforementioned frequency range 2-12.5 Hz. Likewise, the term 
"non-articulation power Pna(hm)" should not be limited to frequency ranges greater than 
the frequency range associated with articulation power PA(m,i). The non-articulation 
frequency range may or may not overlap with or be adjacent to the articulation frequency 
25 range. The non-articulation frequency range may also include frequencies less than the 
lowest frequency in the articulation frequency range, such as those associated with the 
DC -component of frame m of critical band signal ai(t). 

In step 320, for each modulation spectrum Aj(m,f), articulatory analysis 
module 26 performs a comparison between articulation power PA(m,i) and non- 
30 articulation power PNA(m,i). In this embodiment of articulatory analysis module 26, the 
comparison between articulation power PA(m,i) and non-articulation power PwA(m,i) is an 
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10 



15 



20 



articulation-to-non-articulation ratio ANR(m,i). The ANR is defined by the following 



equation 



4^rrw «\ P A (m,i)+e 
ANR(m,i)= 



P NA (m,i)+e 



equation (1) 



where £ is some small constant value. Other comparisons between articulation pow r er 
Pa(hU) and non-articulation power P N a(hm) are possible. For example, the comparison 
may be the reciprocal of equation (1), or the comparison may be a difference between 
articulation power PA(m,i) and non-articulation power PnaChM). For ease of discussion, 
the embodiment of articulatory analysis module 26 depicted by flowchart 300 will be 
discussed with respect to the comparison using ANR(m,i) of equation (1). This should 
not, however, be construed to limit the present invention in any manner. 

In step 330, ANR(m,i) is used to determine local speech quality LSQ(m) 
for frame m. Local speech quality LSQ(m) is determined using an aggregate of the 
articulation-to-non-articulation ratio ANR(m,i) across all channels i and a weighing 
factor R(m,i) based on the DC-component power PN 0 (m,i). Specifically, local speech 
quality LSQ(m) is determined using the following equation 



LSQ(m)=log 



]TANR(m,i)R(m,i) 



equation (2) 



where 



R(m,i)- H6 



£log(l+P No (m,k) 



equation (3) 



and k is a frequency index. 

In step 340, overall speech quality SQ for speech signal s(t) is determined 
using local speech quality LSQ(m) and a log power P s (m) for frame m. Specifically, 
speech quality SQ is determined using the following equation 



SQ=l{P 5 (m)LSQ(m)}^ 1 = 



j P s x (m)LSQ x (m) 



L p . >p * 



X 



equation (4) 



- 23 - 



D.S. Kim 3 



where P s (m)=log 



. dm 



, L is Lp-norm, T is the total number of frames in speech 



signal s(t), k is any value, and P t h is a threshold for distinguishing between audible signals 
and silence. In one embodiment, X is preferably an odd integer value. 

The output of articulatory analysis module 26 is an assessment of speech 
5 quality SQ over all frames m. That is, speech quality SQ is a speech quality assessment 
for speech signal s(t). 

Although the present invention has been described in considerable detail 
with reference to certain embodiments, other versions are possible. Therefore, the spirit 
and scope of the present invention should not be limited to the description of the 
10 embodiments contained herein. 
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